perf: Add mergeHistogram fast path #66

carsonip · 2023-08-02T11:47:05Z

Add fast path for mergeHistogram since it is using the most time in CPU profiles. This optimizes for cases where len(from.Bucket) << len(to.Bucket) so that we will have O(m lg n) as opposed to the regular O(n + m), where m = len(from.Bucket) and n = len(to.Bucket). This idea is from #49 but not implemented in the merged #52.

benchstat:

goos: linux
goarch: amd64
pkg: github.com/elastic/apm-aggregation/aggregators
cpu: 11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz
                                       │ bench-out/merge-histogram-fastpath-7-before │ bench-out/merge-histogram-fastpath-7-after │
                                       │                   sec/op                    │       sec/op         vs base               │
AggregateCombinedMetrics-16                                              1.893µ ± 6%           1.878µ ± 3%        ~ (p=0.310 n=6)
AggregateBatchSerial-16                                                  4.469µ ± 5%           4.407µ ± 4%        ~ (p=0.394 n=6)
AggregateBatchParallel-16                                                4.968µ ± 6%           4.957µ ± 2%        ~ (p=0.669 n=6)
CombinedMetricsEncoding-16                                               907.6n ± 5%           902.3n ± 3%        ~ (p=0.818 n=6)
CombinedMetricsToBatch-16                                                60.76µ ± 2%           62.81µ ± 2%   +3.38% (p=0.002 n=6)
EventToCombinedMetrics-16                                                398.3n ± 5%           384.2n ± 2%   -3.54% (p=0.002 n=6)
NDJSONSerial/go-2.0.0.ndjson-16                                          16.67m ± 2%           15.33m ± 5%   -8.00% (p=0.002 n=6)
NDJSONSerial/nodejs-3.29.0.ndjson-16                                     9.418m ± 4%           8.933m ± 3%   -5.15% (p=0.004 n=6)
NDJSONSerial/python-6.7.2.ndjson-16                                      23.54m ± 3%           23.39m ± 8%        ~ (p=0.093 n=6)
NDJSONSerial/ruby-4.5.0.ndjson-16                                        13.81m ± 2%           13.31m ± 5%   -3.62% (p=0.041 n=6)
NDJSONParallel/go-2.0.0.ndjson-16                                        16.87m ± 4%           15.02m ± 4%  -10.93% (p=0.002 n=6)
NDJSONParallel/nodejs-3.29.0.ndjson-16                                   9.521m ± 9%           8.890m ± 5%   -6.62% (p=0.002 n=6)
NDJSONParallel/python-6.7.2.ndjson-16                                    23.20m ± 2%           23.00m ± 1%        ~ (p=0.394 n=6)
NDJSONParallel/ruby-4.5.0.ndjson-16                                      14.12m ± 3%           13.28m ± 3%   -5.94% (p=0.002 n=6)
geomean                                                                  397.7µ                384.7µ        -3.28%

                                       │ bench-out/merge-histogram-fastpath-7-before │ bench-out/merge-histogram-fastpath-7-after │
                                       │                    B/op                     │        B/op         vs base                │
AggregateCombinedMetrics-16                                           1.103Ki ± 1%           1.104Ki ± 1%       ~ (p=0.859 n=6)
AggregateBatchSerial-16                                               1.368Ki ± 2%           1.362Ki ± 2%       ~ (p=0.180 n=6)
AggregateBatchParallel-16                                             1.353Ki ± 2%           1.372Ki ± 1%       ~ (p=0.234 n=6)
CombinedMetricsEncoding-16                                              0.000 ± 0%             0.000 ± 0%       ~ (p=1.000 n=6) ¹
CombinedMetricsToBatch-16                                             3.983Ki ± 0%           3.983Ki ± 0%       ~ (p=0.472 n=6)
EventToCombinedMetrics-16                                               0.000 ± 0%             0.000 ± 0%       ~ (p=1.000 n=6) ¹
NDJSONSerial/go-2.0.0.ndjson-16                                       9.237Mi ± 1%           9.219Mi ± 1%       ~ (p=0.394 n=6)
NDJSONSerial/nodejs-3.29.0.ndjson-16                                  5.564Mi ± 1%           5.576Mi ± 1%       ~ (p=0.818 n=6)
NDJSONSerial/python-6.7.2.ndjson-16                                   15.35Mi ± 1%           15.33Mi ± 0%       ~ (p=0.310 n=6)
NDJSONSerial/ruby-4.5.0.ndjson-16                                     8.254Mi ± 0%           8.249Mi ± 0%       ~ (p=0.589 n=6)
NDJSONParallel/go-2.0.0.ndjson-16                                     9.251Mi ± 0%           9.242Mi ± 1%       ~ (p=0.180 n=6)
NDJSONParallel/nodejs-3.29.0.ndjson-16                                5.562Mi ± 1%           5.599Mi ± 1%       ~ (p=0.132 n=6)
NDJSONParallel/python-6.7.2.ndjson-16                                 15.33Mi ± 0%           15.39Mi ± 0%       ~ (p=0.132 n=6)
NDJSONParallel/ruby-4.5.0.ndjson-16                                   8.248Mi ± 0%           8.249Mi ± 0%       ~ (p=0.589 n=6)
geomean                                                                            ²                       +0.14%               ²
¹ all samples are equal
² summaries must be >0 to compute geomean

                                       │ bench-out/merge-histogram-fastpath-7-before │ bench-out/merge-histogram-fastpath-7-after │
                                       │                  allocs/op                  │     allocs/op       vs base                │
AggregateCombinedMetrics-16                                             17.00 ± 0%             17.00 ± 0%       ~ (p=1.000 n=6) ¹
AggregateBatchSerial-16                                                 13.00 ± 0%             13.00 ± 0%       ~ (p=1.000 n=6) ¹
AggregateBatchParallel-16                                               13.00 ± 0%             13.00 ± 8%       ~ (p=0.455 n=6)
CombinedMetricsEncoding-16                                              0.000 ± 0%             0.000 ± 0%       ~ (p=1.000 n=6) ¹
CombinedMetricsToBatch-16                                               95.00 ± 0%             95.00 ± 0%       ~ (p=1.000 n=6) ¹
EventToCombinedMetrics-16                                               0.000 ± 0%             0.000 ± 0%       ~ (p=1.000 n=6) ¹
NDJSONSerial/go-2.0.0.ndjson-16                                        114.0k ± 3%            113.9k ± 2%       ~ (p=1.000 n=6)
NDJSONSerial/nodejs-3.29.0.ndjson-16                                   63.18k ± 1%            63.86k ± 1%       ~ (p=0.310 n=6)
NDJSONSerial/python-6.7.2.ndjson-16                                    162.7k ± 1%            163.3k ± 0%       ~ (p=0.132 n=6)
NDJSONSerial/ruby-4.5.0.ndjson-16                                      98.75k ± 0%            98.77k ± 1%       ~ (p=0.937 n=6)
NDJSONParallel/go-2.0.0.ndjson-16                                      113.8k ± 2%            114.2k ± 1%       ~ (p=0.394 n=6)
NDJSONParallel/nodejs-3.29.0.ndjson-16                                 63.45k ± 1%            64.15k ± 1%  +1.10% (p=0.015 n=6)
NDJSONParallel/python-6.7.2.ndjson-16                                  163.1k ± 1%            162.3k ± 1%       ~ (p=1.000 n=6)
NDJSONParallel/ruby-4.5.0.ndjson-16                                    98.78k ± 0%            98.88k ± 1%       ~ (p=0.485 n=6)
geomean                                                                            ²                       +0.18%               ²
¹ all samples are equal
² summaries must be >0 to compute geomean

carsonip · 2023-08-02T11:51:14Z

aggregators/merger.go

-			fromIdx++
+	var extra int
+	toLen, fromLen := len(to.Buckets), len(from.Buckets)
+	if fromLen < toLen { // Heuristics to trade between O(m lg n) and O(n + m).


[For reviewer] We can change this to if fromLen < toLen / 2 or any other heuristics, but I assume this is good enough to rule out the edge case where it is 100% impossible to merge without additional slots.

Also, if fromLen < toLen / 2 isn't faster in benchmarks

lahsivjar · 2023-08-02T12:39:13Z

aggregators/merger.go

+		for fromIdx := 0; fromIdx < fromLen; fromIdx++ {
+			toIdx, found := sort.Find(toLen, func(toIdx int) int {
+				return int(from.Buckets[fromIdx] - to.Buckets[toIdx])
+			})
+			if found {
+				to.Counts[toIdx] += from.Counts[fromIdx]
+				from.Counts[fromIdx] = 0
+			} else {
+				extra++
+			}
+		}


Suggested change

for fromIdx := 0; fromIdx < fromLen; fromIdx++ {

toIdx, found := sort.Find(toLen, func(toIdx int) int {

return int(from.Buckets[fromIdx] - to.Buckets[toIdx])

})

if found {

to.Counts[toIdx] += from.Counts[fromIdx]

from.Counts[fromIdx] = 0

} else {

extra++

}

}

findIn := to

for fromIdx := 0; fromIdx < fromLen; fromIdx++ {

target := from.Buckets[fromIdx]

toIdx, found := sort.Find(len(findIn), func(toIdx int) int {

return int(target - findIn.Buckets[toIdx])

})

if toIdx == len(findIn) {

// there is no bucket in `findIn` which is less than target

// so there cant be any bucket in `findIn` which is more than target

break

}

if found {

findIn.Counts[toIdx] += from.Counts[fromIdx]

from.Counts[fromIdx] = 0

} else {

extra++

}

findIn = findIn[toIdx:] // next target will definitely be greater than the current one

}

We can optimize the search a bit more by considering from to be sorted. Added a sample code above but please double check as I wrote it in a hurry to suggest the idea.

oh this is clever

Actually I have an idea that will be applicable to more cases. Instead of searching in 0..toLen each time, we utilize the toIdx from the previous pass, and search in 0..toIdx+1. Will require the this loop to be reversed for fromIdx := 0; fromIdx < fromLen; fromIdx++ {.

Moreover, in your code we must ensure extra > 0 when we break, and we'll also have to correctly determine extra. These shortcircuits are dangerous 😢

if toIdx == len(findIn) { break }

Updated the PR to limit the search space with the help of result of the previous iteration.

Looking at this again, I think it might be better to just optimise the merger for exactly same buckets in to and from, if any of the bucket differs we fallback to previous logic. I think it will be simpler and give us the same benefits with slightly better performance. WDYT?

if any of the bucket differs we fallback to previous logic. I think it will be simpler and give us the same benefits with slightly better performance.

Yes we can fallback to the previous logic, but we'll still have to pick between O(n+m) or O(m lg n) for counting extra buckets, so we can grow the slice in one go. Since in a lot of cases m << n, having O(m lg n) just for counting may still be faster than O(n+m). Imo the performance isn't that clear cut. Am I missing anything?

Ah right, since the merge is O(n+m), then the runtime of the whole function where we do binary search without falling back immediately becomes max(O(m lg n), O(n+m)). Yes, in theory if we fallback early, there is slightly better performance. I'll see how I can incorporate this.

carsonip · 2023-08-02T20:47:51Z

Closing this in favor of #69

carsonip added 2 commits August 2, 2023 11:48

Add mergeHistogram fastpath code

932d149

Tidy up fast path

e05f0fb

carsonip requested a review from a team as a code owner August 2, 2023 11:47

carsonip added 2 commits August 2, 2023 12:47

Add note about mutating input

9ec068a

make fmt

c198c54

carsonip commented Aug 2, 2023

View reviewed changes

carsonip requested a review from lahsivjar August 2, 2023 11:51

lahsivjar reviewed Aug 2, 2023

View reviewed changes

carsonip added 5 commits August 2, 2023 13:59

Fix typo

ee39e5d

Limit search space according to previous result

cc92e03

Add explanation

5003bc7

Fix typo

274e722

Fallback immediately on new bucket

cdb5beb

carsonip mentioned this pull request Aug 2, 2023

perf: Add mergeHistogram fast path for single value from #69

Merged

carsonip closed this Aug 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: Add mergeHistogram fast path #66

perf: Add mergeHistogram fast path #66

carsonip commented Aug 2, 2023 •

edited

Loading

carsonip Aug 2, 2023

carsonip Aug 2, 2023

lahsivjar Aug 2, 2023

carsonip Aug 2, 2023

carsonip Aug 2, 2023 •

edited

Loading

carsonip Aug 2, 2023

carsonip Aug 2, 2023

lahsivjar Aug 2, 2023

carsonip Aug 2, 2023 •

edited

Loading

carsonip Aug 2, 2023 •

edited

Loading

carsonip commented Aug 2, 2023

perf: Add mergeHistogram fast path #66

perf: Add mergeHistogram fast path #66

Conversation

carsonip commented Aug 2, 2023 • edited Loading

carsonip Aug 2, 2023

Choose a reason for hiding this comment

carsonip Aug 2, 2023

Choose a reason for hiding this comment

lahsivjar Aug 2, 2023

Choose a reason for hiding this comment

carsonip Aug 2, 2023

Choose a reason for hiding this comment

carsonip Aug 2, 2023 • edited Loading

Choose a reason for hiding this comment

carsonip Aug 2, 2023

Choose a reason for hiding this comment

carsonip Aug 2, 2023

Choose a reason for hiding this comment

lahsivjar Aug 2, 2023

Choose a reason for hiding this comment

carsonip Aug 2, 2023 • edited Loading

Choose a reason for hiding this comment

carsonip Aug 2, 2023 • edited Loading

Choose a reason for hiding this comment

carsonip commented Aug 2, 2023

carsonip commented Aug 2, 2023 •

edited

Loading

carsonip Aug 2, 2023 •

edited

Loading

carsonip Aug 2, 2023 •

edited

Loading

carsonip Aug 2, 2023 •

edited

Loading